Skip to content

Fix ti_skip_downstream overwriting RUNNING tasks to SKIPPED#63266

Draft
sam-dumont wants to merge 2 commits intoapache:mainfrom
sam-dumont:fix/ti-skip-downstream-state-guard
Draft

Fix ti_skip_downstream overwriting RUNNING tasks to SKIPPED#63266
sam-dumont wants to merge 2 commits intoapache:mainfrom
sam-dumont:fix/ti-skip-downstream-state-guard

Conversation

@sam-dumont
Copy link
Contributor

ti_skip_downstream() issues an UPDATE filtered by (dag_id, run_id, task_id, map_index) without a state guard. When a BranchOperator on one scheduler decides to skip downstream tasks, the UPDATE can overwrite a task already RUNNING on a worker. The worker's next heartbeat returns 409 with current_state: skipped, killing the task mid-execution.

This is a companion fix to #60330, which guards schedule_tis() against the same class of race condition. Different code path (Execution API routes vs dagrun.py), same root cause : unguarded bulk UPDATEs on TI state.

Production data (12 days, 5 schedulers, ~500 concurrent workers)

We deployed both fixes as monkey patches on our prod cluster and monitored 409 heartbeat errors via CloudWatch :

Before any fix       14-169 errors/day
After schedule_tis   3-4/day (all current_state: skipped)
After both fixes     0 errors for 18+ hours
Metric Before fixes After schedule_tis only After both fixes
Total 409s/day 14-169 3-4 0
current_state: scheduled present 0 0
current_state: failed 47/day 0 0
current_state: skipped 8/day 2-5/day 0

Fix

Add skippable_state_clause to the UPDATE's WHERE clause :

skippable_state_clause = or_(
    TI.state.is_(None),
    TI.state.not_in([RUNNING, SUCCESS, FAILED]),
)

The or_(IS NULL, NOT IN) pattern handles SQL NULL semantics : NULL NOT IN (...) evaluates to NULL (falsy), so tasks with state=None need an explicit IS NULL check to remain skippable.

QUEUED is intentionally NOT guarded : a QUEUED task hasn't started executing yet, so the BranchOperator's decision should take priority. The worker pod will get a benign 409 on PATCH /run and exit cleanly. Blocking QUEUED would cause a semantic error where the wrong branch executes.

Tests

5 regression tests in TestTISkipDownstreamRaceCondition :

  • RUNNING / SUCCESS / FAILED tasks protected from overwrite (parametrized)
  • QUEUED task correctly skipped (BranchOperator decision wins over queue)
  • None-state task still correctly skipped (happy path)

related: #59378

related: #60330

related: #57618


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (claude-opus-4-6)

Generated-by: Claude Code (claude-opus-4-6) following the guidelines

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:task-sdk labels Mar 10, 2026
In HA deployments, ti_skip_downstream() issues a bulk UPDATE without
a state guard. When a BranchOperator decides to skip downstream tasks,
it can overwrite a task already RUNNING on a worker to SKIPPED, causing
a 409 heartbeat conflict that kills the task mid-execution.

Add a skippable_state_clause to the UPDATE WHERE clause so RUNNING,
SUCCESS, and FAILED tasks are never overwritten to SKIPPED.

QUEUED tasks are intentionally allowed to be skipped: no work has been
done yet and the BranchOperator's decision should take priority. The
worker pod will get a benign 409 on PATCH /run and exit cleanly.

closes: apache#59378
@sam-dumont sam-dumont force-pushed the fix/ti-skip-downstream-state-guard branch from f4a3ded to 589633a Compare March 10, 2026 14:44
@potiuk
Copy link
Member

potiuk commented Mar 11, 2026

@sam-dumont This PR has been converted to draft because it does not yet meet our Pull Request quality criteria.

Issues found:

  • Pre-commit / static checks: Failing: CI image checks / Static checks. Run prek run --from-ref main locally to find and fix issues. See Pre-commit / static checks docs.
  • mypy (type checking): Failing: CI image checks / MyPy checks (mypy-airflow-core), CI image checks / MyPy checks (mypy-task-sdk). Run prek --stage manual mypy-airflow-core --all-files && prek --stage manual mypy-task-sdk --all-files locally to reproduce. You need breeze ci-image build --python 3.10 for Docker-based mypy. See mypy (type checking) docs.
  • Provider tests: Failing: provider distributions tests / Compat 2.11.1:P3.10:, Postgres tests: providers / DB-prov:Postgres:14:3.10:-amazon,celer...standard, MySQL tests: providers / DB-prov:MySQL:8.0:3.10:-amazon,celer...standard, Sqlite tests: providers / DB-prov:Sqlite:3.10:-amazon,celer...standard, Non-DB tests: providers / Non-DB-prov::3.10:-amazon,celer...standard (+7 more). Run provider tests with breeze run pytest <provider-test-path> -xvs. See Provider tests docs.
  • Other failing CI checks: Failing: CI image checks / Test Python API client, Postgres tests: core / DB-core:Postgres:14:3.10:API...Serialization, MySQL tests: core / DB-core:MySQL:8.0:3.10:API...Serialization, Sqlite tests: core / DB-core:Sqlite:3.10:API...Serialization, Non-DB tests: core / Non-DB-core::3.10:API...Serialization (+8 more). Run prek run --from-ref main locally to reproduce. See static checks docs.

Note: Your branch is 45 commits behind main. Some check failures may be caused by changes in the base branch rather than by your PR. Please rebase your branch and push again to get up-to-date CI results.

What to do next:

  • The comment informs you what you need to do.
  • Fix each issue, then mark the PR as "Ready for review" in the GitHub UI - but only after making sure that all the issues are fixed.
  • Maintainers will then proceed with a normal review.

Converting a PR to draft is not a rejection — it is an invitation to bring the PR up to the project's standards so that maintainer review time is spent productively. If you have questions, feel free to ask on the Airflow Slack.

@potiuk potiuk marked this pull request as draft March 11, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants